Part A

First of all let’s load the data using the package *data.table**:

# Data loading
test = fread(file = 'test.csv')
train = fread(file = 'train.csv')

Pick 10 random observation to take apart for the second part:

set.seed(1999) # for reproducibility
m = 10

# Sample 10 indexes
indexes = sample(1:nrow(train), m)

# Take apart the 10 observations for the second part
data.sample = train[indexes]
target = data.sample$tempo
train = train[-indexes, ]

Now let’s prepare our data: we don’t need the id neither for the dimensionality reduction or the regression, while we would like to preserve the genre feature.

# Extract the variable tempo from the train test
tempo = train$tempo

# Extract the variables id and genre from the train and test set
id_train = train$id
id_test = test$id
genre_train = train$genre
genre_test = test$genre

# Remove all these variables from the data in order to prepare it for the dimensionality reduction
train = train[, -c(7040, 7041, 7042)]
test = test[, -c(7040, 7041)]

As our target corresponds to the variable tempo, we would like to see the correlation with the other features:

## [1] "Correlation with genre: -0.129448380511222"

As we can see from the plot we get low linear correlations; we also observe that we get two blocks with higher correlation at the beginning, which are the first two frequencies for each of the 171 time chunks, and some higher correlations at the end with the properties of the signal.

Knowing this, we decided to take into consideration for the KPCA the first 171 features, the second 171 features with a minus sign and the mean of each of the other 38 blocks of frequencies.

# Take the first group of 171 features
df = as.data.frame(train[, 1:171])
df.test = as.data.frame(test[, 1:171])

# Take the second group of 171 features with a minus sign
df = cbind(df, -train[,172:342])
df.test = cbind(df.test, -test[,172:342])

# Take only the mean of the others 38 blocks of frequencies
for (i in 3:40){
  # For the train set
  df = cbind(df, rowMeans(train[,(1*(i-1)):(171*i)]))
}
colnames(df) = paste("Col", 1:ncol(df), sep = ".")

for (i in 3:40){
  # For the test set
  df.test = cbind(df.test, rowMeans(test[,(1*(i-1)):(171*i)]))
}
colnames(df.test) = paste("Col", 1:ncol(df.test), sep = ".")

It’s now time to get into the dimensionality reduction. After trying several methods and configurations, we opted for a Kernel-PCA with a rbfdot kernel considering only the first 3 principal components. After the dimensionality reduction, we attached the signal properties to the resulting data frame. We used the kPCA provided by the package kernlab, fixing the parameter sigma equal to \(0.2\).

# kPCA
kpc = kpca(~., data = df, kernel = "rbfdot",
           kpar = list(sigma = 0.2), features = 3)

# Store the obtained principal components
train_pca = pcv(kpc)

# Add colnames to the matrix 
colnames(train_pca) = paste("Comp", 1:ncol(train_pca), sep = ".")

# Add the 'genre' as factor
train_pca = cbind(train_pca, as.factor(genre_train), train[, 7034:7039])

# Project the test data on the same manifold
test_pca = predict(kpc, df.test)
colnames(test_pca) = paste("Comp", 1:ncol(test_pca), sep = ".") 
test_pca = cbind(test_pca, as.factor(genre_test), test[, 7034:7039])

Now our data are ready for a regression model. Again after several trials, we decided to use an SVM with a radial kernel. We used the method ‘svmRadial’ from the package caret. We also set a control to a 5-fold cross validation to check our model and tune the parameters, before submitting the predictions on the test.

# Set a 5-fold cross validation
ctrl = trainControl(method = "cv", number = 5)

# Run the 'svmRadial' model 
model = train(tempo ~., data = as.data.frame(cbind(train_pca, tempo)), 
              method = "svmRadial", trControl = ctrl)
model
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 1551 samples
##   10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1242, 1240, 1241, 1241, 1240 
## Resampling results across tuning parameters:
## 
##   C     RMSE      Rsquared   MAE     
##   0.25  21.35166  0.4515537  11.87166
##   0.50  21.06324  0.4663642  11.57633
##   1.00  20.92660  0.4738745  11.40496
## 
## Tuning parameter 'sigma' was held constant at a value of 0.03546633
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.03546633 and C = 1.

At the end, our best result is obtained with \(C = 1\) and \(sigma = 0.035\). The \(RMSE\) in the cross validation is equal to \(20.93\) and we get an \(R^2\) of \(0.47\). The fit of the model to the data is not perfect, but we can still consider it.

Let’s now create a csv file with the predictions, to submit on Kaggle:

# Predict the 'tempo' on the test set
predictions = predict(model, test_pca)

# Create the csv
result = as.data.frame(cbind(id_test, predictions))
colnames(result) = c('id', 'target')
write.csv(result, file = 'result.csv', row.names = FALSE)

Part B

We have implemented a function to split into two equal subsets the 10 observations we have previously extracted from the train set.

# Function used for splitting the data into two equal subsets
split = function(data, target){
  # Input:
  #   - data: data to be splitted
  #   - target: target variable
  # Output:
  #   - List containing the data and the target splitted in train and test
  index = sample(1:nrow(data), nrow(data)/2)
  data.train = data[index]
  data.test = data[-index]
  target.train = target[index]
  target.test = target[-index]
  my_list = list("train" = data.train, "test" = data.test, 
                 "target_train" = target.train, "target_test" = target.test)
  return(my_list)
}

We have performed on the data the same pre-processing steps applied in Job A.

set.seed(123) # for reproducibility

# Split the data
data = split(data.sample, target)

# Store the results of the split
train_target = data$target_train
test_target = data$target_test
data_train = data$train
data_test = data$test 

# Store apart the features id and genre
train.genre = data_train$genre
test.genre = data_test$genre
train.id = data_train$id
test.id = data_test$id

# Remove those features both from the test and train set
data_train = data_train[,-c(7040, 7041, 7042)]
data_test = data_test[,-c(7040, 7041, 7042)]

As before, we considered the first block of 171 Mel-frequency cepstral coefficients, the second block with a minus sign and the mean for all the other 38 blocks.

# Train set
df.B = as.data.frame(data_train[, 1:171])
df.B = cbind(df.B, -data_train[,172:342])
for (i in 3:40){
  df.B = cbind(df.B, rowMeans(data_train[,(1*(i-1)):(171*i)]))
}
colnames(df.B) = paste("Col", 1:ncol(df.B), sep = ".")

# Test set
df.test.B = as.data.frame(data_test[, 1:171])
df.test.B = cbind(df.test.B, -data_test[,172:342])
for (i in 3:40){
  df.test.B = cbind(df.test.B, rowMeans(data_test[,(1*(i-1)):(171*i)]))
}
colnames(df.test.B) = paste("Col", 1:ncol(df.test.B), sep = ".")

We computed the kPCA keeping the same parameters and numbers of components from the previous analysis.

At the end, we decided to not consider the variable corresponding to the genre, because it is a factor and there is the possibility to have different labels in our new train and test set.

# kPCA
kpc.B = kpca(~., data = df.B, kernel = "rbfdot",
             kpar = list(sigma = 0.2), features = 3)

# Train set
train_pca.B = pcv(kpc.B)
colnames(train_pca.B) <- paste("Comp", 1:ncol(train_pca.B), sep = ".")
train_pca.B = cbind(train_pca.B, data_train[, 7034:7039])

# Test set
test_pca.B = predict(kpc.B, df.test.B)
colnames(test_pca.B) <- paste("Comp", 1:ncol(test_pca.B), sep = ".")
test_pca.B = cbind(test_pca.B, data_test[, 7034:7039])

Again, we used a SVM regression model with radial kernel.

# SVM Regression

# Set a 2-fold cross validation
ctrl.B = trainControl(method = "cv", number = 2)

# Run the 'svmRadial' model 
model.B = train(train_target ~., 
                data = as.data.frame(cbind(train_pca.B, train_target)), 
                method = "svmRadial", trControl = ctrl.B)
model.B
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 5 samples
## 9 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (2 fold) 
## Summary of sample sizes: 2, 3 
## Resampling results across tuning parameters:
## 
##   C     RMSE      Rsquared  MAE     
##   0.25  25.75751  0.636018  20.74822
##   0.50  25.05565  0.636018  20.50059
##   1.00  23.77135  0.636018  20.50071
## 
## Tuning parameter 'sigma' was held constant at a value of 0.07180026
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.07180026 and C = 1.

We can notice how we obtain better results in terms of \(R^2\) in the cross-validation. Obviously, the reason is that we have only 5 observations here, so the results on the training set become higher. Then, now the best tuned value for \(C\) is \(1\) and \(sigma = 0.07\).

After the training of the model, we fitted it on the test set.

predictions_test = predict(model.B, test_pca.B)

We implemented functions for computing the confidence bands of the predictions. In addition, we decided to fix the significance level \(\alpha = 0.80\).

# Set alpha
alpha = 0.80

# Get k from alpha and n
k = function(alpha, n = 10){
  x = ( n/2 + 1 ) * ( 1-alpha )
  x = ceiling(x)
  if(x >= n/2) return( n/2 )
  return(x)
}

my_k = k(alpha)

# Absolute Error function between the predicted values and the target feature
AE = function(preds, target){
  return(abs(preds-target))
}

# Regression function which uses the implemented model on the test data
regression_function = function(model, data){
  return(predict(model, data))
} 

# Compute and sort the residuals
residuals = function(model, test, target_test){
  # Apply the regression model
  preds = regression_function(model, test)
  # Compute the residuals
  my_residuals = AE(preds, target_test)
  return(sort(my_residuals))
}

my_residuals = residuals(model.B, test_pca.B, test_target)

# Get d
d = my_residuals[my_k]

This is the function used for the confidence bands.

# Compute the confidence bands
confidence_bands = function(model, test, d){
  bands = c()
  # Apply the regression model
  preds = regression_function(model, test)
  # Compute the bands
  for(i in 1:length(preds)){
    band = c(preds[i] - d, preds[i] + d)  
    bands = rbind(bands, band)
  }
  return(bands)
}

# Get the confidence bands
bands = confidence_bands(model.B, test_pca.B, d)

# Set row and column names
rownames(bands) = NULL; colnames(bands) = c('lower', 'upper')

bands
##         lower    upper
## [1,] 136.8690 171.7762
## [2,] 153.2728 188.1800
## [3,] 154.4304 189.3376
## [4,] 133.0196 167.9268
## [5,] 140.0000 174.9072

In the plot, we have represented the target tempo with dots and the confidence bands as segments for each song.

In the above plot, we can see how only one target tempo is outside our predicted confidence bands. This is a good performance, but we have to keep in mind that it strictly depends on the choices made in the splitting part. The model is trained only on 5 observations, so it is not really reliable.

Now, we extended our analysis taking randomly 100 observations from the test set. Obviously, we run the same pre-processing steps and we applied the same dimensionality reduction and model to the data.

# Take 100 random test obs
set.seed(123)
m = 100

# Sample the indexes
indexes = sample(1:nrow(test), m)
test_sample = test[indexes]

# Pre-processing of the dataset
test_sample_id = id_test[indexes]
test_sample_genre = genre_test[indexes] 

df.test.sample = as.data.frame(test_sample[, 1:171])
df.test.sample = cbind(df.test.sample, -test_sample[,172:342])

for (i in 3:40){
  df.test.sample = cbind(df.test.sample, rowMeans(test_sample[,(1*(i-1)):(171*i)]))
}
colnames(df.test.sample) = paste("Col", 1:ncol(df.test.sample), sep = ".")

# Apply the same kPCA as before on the 100-sampled test observations
test_sample_pca = predict(kpc.B, df.test.sample)
colnames(test_sample_pca) = paste("Comp", 1:ncol(test_sample_pca), sep = ".")
test_sample_pca = cbind(test_sample_pca, test_sample[, 7034:7039])

# Get the bands using the same SVM regression model
bands = confidence_bands(model.B, test_sample_pca, d)

Below, we can see the predicted confidence bands for each of the observations.

We can see that there is some variability among the bands. Moving the cursor on the bands we can see the respective bands’ extremes and the ID of the song.

Now, let’s see some insights about them.

##                  values
## width             34.91
## mean.lower.bound 144.89
## mean.upper.bound 179.80

We obtained bands with a width of \(34.91\), the mean lower bound is \(144.89\) and the mean upper bound is \(179.80\).

Let’s see a boxplot of the central values of these intervals, compared with the distribution of the target feature tempo in the entire training set.

We can see how the central values of the intervals for the 100 test observations has a shorter range than the entire tempo distribution. The reason is that the model used has been trained only on 5 observations, so its prediction power really depends on the values of that observations.

In conclusion, we can affirm that returning a prediction in the form of a band can be more useful and informative than returning a single-point prediction given the uncertainty level of the model.